AITopics | reliability diagram

Collaborating Authors

reliability diagram

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Calibration without labels in multiple testing

Wadekar, Adway S., Soloff, Jake A.

arXiv.org Machine LearningJun-19-2026

Large-scale hypothesis testing supports probability claims about individual hypotheses, as in empirical Bayes methods for estimating local false discovery rates. We study how such claims can be interpreted as approximately calibrated forecasts of the null hypothesis, yielding interpretable error probabilities even under model misspecification. Our approach draws conceptual inspiration from probabilistic forecasting but addresses a different challenge: unlike forecasting, where labels are eventually observed, in multiple testing the ground truth is never revealed, so calibration must be assessed stochastically and established indirectly. We address this challenge by constructing a set of pseudo-labels, derived from the spacings of ordered $p$-values, which have the local false discovery rate as their regression target. Our construction unlocks existing tools for assessing and performing post-hoc calibration in multiple testing. Notably, we find on a large-scale empirical survey of published psychology and neuroscience literature that the $q$-value, a popular error measure based on the false discovery rate, can be severely miscalibrated.

artificial intelligence, lfdr, machine learning, (19 more...)

arXiv.org Machine Learning

2606.19737

Country:

North America > United States > Michigan (0.40)
North America > United States > California (0.28)

Genre: Research Report (0.67)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering

Wang, Zhanliang, Xiao, Jiancong, Jin, Ruochen, Yang, Shu, Hou, Bojian, Shen, Li

arXiv.org Machine LearningMay-12-2026

Calibration measures whether a model's predicted confidence aligns with its empirical accuracy, and is central to the reliable deployment of large language models (LLMs) in high-stakes domains such as medicine and law. While much recent work focuses on improving LLM calibration, the equally important question of how to evaluate it in realistic settings remains underdeveloped. Open-ended question answering (QA), the most common deployment setting for modern LLMs, is where existing evaluation methods fall short: logit-based metrics need restricted output formats and internal probabilities; verbalized confidence is self-reported and often overconfident; and sampling-based methods rely on task-specific extraction rules without a clear finite-sample target. We introduce Sem-ECE (Semantic-Sampling Expected Calibration Error), a calibration evaluation framework for open-ended QA that samples answers from the model, groups them into semantic classes, and uses the resulting frequencies as confidence. We study two estimators within this framework: Sem$_1$-ECE, the same-sample self-consistency score, and Sem$_2$-ECE, a held-out variant that separates answer selection from confidence evaluation. We prove both are asymptotically unbiased, and further show that they agree on easy questions but diverge on hard ones with Sem$_2$ achieving strictly smaller calibration error, so their gap also serves as a diagnostic for question difficulty. Experiments on three open-ended QA benchmarks across five leading commercial LLMs match our theoretical predictions and show that Sem-ECE outperforms verbalized confidence and existing sampling-based methods, while complementing logit-based evaluation when internal probabilities are unavailable.

large language model, machine learning, natural language, (21 more...)

arXiv.org Machine Learning

2605.08432

Country: North America > United States > Pennsylvania (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification

Franzen, Courtney, Pourkamali-Anaraki, Farhad

arXiv.org Machine LearningApr-8-2026

Neural network classifiers trained with cross-entropy loss achieve strong predictive accuracy but lack the capability to provide inherent predictive uncertainty estimates, thus requiring external techniques to obtain these estimates. In addition, softmax scores for the true class can vary substantially across independent training runs, which limits the reliability of uncertainty-based decisions in downstream tasks. Evidential Deep Learning aims to address these limitations by producing uncertainty estimates in a single pass, but evidential training is highly sensitive to design choices including loss formulation, prior regularization, and activation functions. Therefore, this work introduces an alternative Dirichlet parameter estimation strategy by applying a method of moments estimator to ensembles of softmax outputs, with an optional maximum-likelihood refinement step. This ensemble-based construction decouples uncertainty estimation from the fragile evidential loss design while also mitigating the variability of single-run cross-entropy training, producing explicit Dirichlet predictive distributions. Across multiple datasets, we show that the improved stability and predictive uncertainty behavior of these ensemble-derived Dirichlet estimates translate into stronger performance in downstream uncertainty-guided applications such as prediction confidence scoring and selective classification.

artificial intelligence, machine learning, prediction, (18 more...)

arXiv.org Machine Learning

2604.06032

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Colorado > Denver County > Denver (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

sidedCalibrationTheorem

Neural Information Processing SystemsFeb-10-2026, 07:18:42 GMT

Theorem 2. Suppose that the predictive distribution Q has the sufficient ability to approximate the true unknown distribution P, given data is i.i.d. Lm(P,Q) = 0 if and only if P = Q when F is a unit ball in a universal RKHS [13]. Becausetheconfidencelevelp2 p1 is exactly equal to the proportion of samples {y1,,yn} covered by the two-sided prediction interval. B.1 Baselines MC-Dropout (MCD) [12]: A variant of standard dropout, named as Monte-Carlo Dropout. Heteroscedastic Neural Network (HNN) [17]: In this approach, similar to a heteroscedastic regression, the network has two outputs in the last layer, corresponding to the predicted mean and variance for each input xi.

artificial intelligence, dataset, machine learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.05)
North America > United States > District of Columbia > Washington (0.05)
Asia > China > Beijing > Beijing (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.36)

Add feedback

Deep Linear Discriminant Analysis Revisited

Tezekbayev, Maxat, Takhanov, Rustem, Bolatov, Arman, Assylbekov, Zhenisbek

arXiv.org Machine LearningJan-6-2026

We show that for unconstrained Deep Linear Discriminant Analysis (LDA) classifiers, maximum-likelihood training admits pathological solutions in which class means drift together, covariances collapse, and the learned representation becomes almost non-discriminative. Conversely, cross-entropy training yields excellent accuracy but decouples the head from the underlying generative model, leading to highly inconsistent parameter estimates. To reconcile generative structure with discriminative performance, we introduce the \emph{Discriminative Negative Log-Likelihood} (DNLL) loss, which augments the LDA log-likelihood with a simple penalty on the mixture density. DNLL can be interpreted as standard LDA NLL plus a term that explicitly discourages regions where several classes are simultaneously likely. Deep LDA trained with DNLL produces clean, well-separated latent spaces, matches the test accuracy of softmax classifiers on synthetic data and standard image benchmarks, and yields substantially better calibrated predictive probabilities, restoring a coherent probabilistic interpretation to deep discriminant models.

accuracy, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

2601.01619

Country:

North America > United States (0.28)
Asia > Kazakhstan (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Discriminant Analysis (0.60)

Add feedback

Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning

Huang, Jerry, Lu, Peng, Zeng, Qiuhao, Iwasawa, Yusuke, Matsuo, Yutaka, Chandar, Sarath, Marrese-Taylor, Edison, Li, Irene

arXiv.org Machine LearningJan-6-2026

Ensuring that deep learning models are well-calibrated in terms of their predictive uncertainty is essential in maintaining their trustworthiness and reliability, yet despite increasing advances in foundation model research, the relationship between such large language models (LLMs) and their calibration remains an open area of research. In this work, we look at a critical gap in the calibration of LLMs within multilingual settings, in an attempt to better understand how the data scarcity can potentially lead to different calibration effects and how commonly used techniques can apply in these settings. Our analysis on two multilingual benchmarks, over 29 and 42 languages respectively, reveals that even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets. However, improvements in accuracy are marginal or non-existent, resulting in mis-calibration, highlighting a critical shortcoming of standard SFT for multilingual languages. Furthermore, we observe that the use of label smoothing to be a reasonable method alleviate this concern, again without any need for low-resource SFT data, maintaining better calibration across all languages. Overall, this highlights the importance of multilingual considerations for both training and tuning LLMs in order to improve their reliability and fairness in downstream use.

large language model, machine learning, natural language, (21 more...)

arXiv.org Machine Learning

2601.01362

Country:

North America > United States (1.00)
Europe (0.92)
Asia > Middle East > UAE (0.45)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An exact multiple-time-step variational formulation for the committor and the transition rate

Lorpaiboon, Chatipat, Weare, Jonathan, Dinner, Aaron R.

arXiv.org Artificial IntelligenceDec-9-2025

For a transition between two stable states, the committor is the probability that the dynamics leads to one stable state before the other. It can be estimated from trajectory data by minimizing an expression for the transition rate that depends on a lag time. We show that an existing such expression is minimized by the exact committor only when the lag time is a single time step, resulting in a biased estimate in practical applications. We introduce an alternative expression that is minimized by the exact committor at any lag time. The key idea is that, when trajectories enter the stable states, the times that they enter (stopping times) must be used for estimating the committor and transition rate instead of the lag time. Numerical tests on benchmark systems demonstrate that our committor and transition rate estimates are much less sensitive to the choice of lag time. We show how further accuracy for the transition rate can be achieved by combining results from two lag times. We also relate the transition rate expression to a variational approach for kinetic statistics based on the mean-squared residual and discuss further numerical considerations with the aid of a decomposition of the error into dynamic modes.

artificial intelligence, committor, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2509.03539

Country: North America > United States (1.00)

Genre: Research Report (0.50)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Uncertainty Calibration of Multi-Label Bird Sound Classifiers

Schwinger, Raphael, McEwen, Ben, Kather, Vincent S., Heinrich, René, Rauch, Lukas, Tomforde, Sven

arXiv.org Artificial IntelligenceNov-12-2025

Passive acoustic monitoring enables large-scale biodiversity assessment, but reliable classification of bioacoustic sounds requires not only high accuracy but also well-calibrated uncertainty estimates to ground decision-making. In bioacoustics, calibration is challenged by overlapping vocalisations, long-tailed species distributions, and distribution shifts between training and deployment data. The calibration of multi-label deep learning classifiers within the domain of bioacoustics has not yet been assessed. We systematically benchmark the calibration of four state-of-the-art multi-label bird sound classifiers on the BirdSet benchmark, evaluating both global, per-dataset and per-class calibration using threshold-free calibration metrics (ECE, MCS) alongside discrimination metrics (cmAP). Model calibration varies significantly across datasets and classes. While Perch v2 and ConvNeXt$_{BS}$ show better global calibration, results vary between datasets. Both models indicate consistent underconfidence, while AudioProtoPNet and BirdMAE are mostly overconfident. Surprisingly, calibration seems to be better for less frequent classes. Using simple post hoc calibration methods we demonstrate a straightforward way to improve calibration. A small labelled calibration set is sufficient to significantly improve calibration with Platt scaling, while global calibration parameters suffer from dataset variability. Our findings highlight the importance of evaluating and improving uncertainty calibration in bioacoustic classifiers.

artificial intelligence, calibration, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2511.08261

Country: Europe (0.28)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

A Evaluation Metrics A.1 Expected Calibration Error (ECE) Expected calibration error (ECE) [

Neural Information Processing SystemsAug-17-2025, 08:07:48 GMT

The difference between acc and conf can be intuitively seemed as the deviation of the outputs to the diagonal in Figure 1. The higher the accuracy of predictions is, the lower BS is. The details of these datasets are summarized in Table 5. Our data are public and do not contain personally identifiable information and offensive content. The learning rate is 0.01 and the maximum number of iteration is 50.

accuracy, artificial intelligence, machine learning, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Filters

Collaborating Authors

reliability diagram

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Calibration without labels in multiple testing

A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering

46d0671dd4117ea366031f87f3aa0093-Supplemental.pdf

Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification

sidedCalibrationTheorem

Deep Linear Discriminant Analysis Revisited

Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning

An exact multiple-time-step variational formulation for the committor and the transition rate

Uncertainty Calibration of Multi-Label Bird Sound Classifiers

A Evaluation Metrics A.1 Expected Calibration Error (ECE) Expected calibration error (ECE) [